Week 1: Getting started with Anaconda, Jupyter Notebook and Python¶

Exercises to familiarise myself with Jupyter Notebook and its relationship to Python.¶

a) I joined this course because I am interested in the more practical aspects of DMIS. AI can be used as a tool to create and analyse information in many areas.

b) I have prior experience with Python and a little AI through the computer science classes I took last year.

c) I expect to learn:

  • how to use AI to help my productivity
  • how to code for machine learning
  • how to combine these into a practical skillset for a future career
In [1]:
message1 = "Hello, World!"

print (message1[0])
H
In [2]:
from IPython.display import *
YouTubeVideo("tK0vp8LlDiM")
Out[2]:
In [ ]:
import webbrowser #importing a library
import requests #importing a library

print("Shall we hunt down an old website?") #printing intro message
site = input("Type a website URL: ") #assigning a value to a variable
era = input("Type year, month, and date, e.g., 20150613: ") #assigning a value to a variable
url = "http://archive.org/wayback/available?url=%s&timestamp=%s" % (site, era) #assigning a value to a variable
response = requests.get(url) #assigning a value to a variable
data = response.json() #assigning a value to a variable
try:
    old_site = data["archived_snapshots"]["closest"]["url"] #assigning a value to a variable
    print("Found this copy: ", old_site) #printing url of the search result
    print("It should appear in your browser.") #printing message
    webbrowser.open(old_site)
except:
    print("Sorry, could not find the site.") #message to print if no result found
Shall we hunt down an old website?

Week 2. Exploring Data in Multiple Ways¶

In [1]:
from IPython.display import Image

Image ("picture1.jpg")
Out[1]:
In [2]:
from IPython.display import Audio

Audio ("audio1.mid")
Out[2]:
Your browser does not support the audio element.
In [3]:
Audio ("GoldbergVariations_MehmetOkonsar-1of3_Var1to10.ogg")

#This file is licensed under the Creative Commons Attribution-Share Alike 3.0 Unported license.
#You are free: 
#•	to share – to copy, distribute and transmit the work
#•	to remix – to adapt the work
#Under the following conditions: 
#•	attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
#•	share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.
#The original ogg file was found at the url: 
#https://en.wikipedia.org/wiki/File:GoldbergVariations_MehmetOkonsar-1of3_Var1to10.ogg
Out[3]:
Your browser does not support the audio element.

Reflections¶

The .ogg file plays but the .mid does not. I beleive the file type is not compatible. The library must not have a codec for that file type.

Task 3.2: Using matplotlib¶

In [4]:
from matplotlib import pyplot
test_picture = pyplot.imread("picture1.jpg")
print("Numpy array of the image is: ", test_picture)
pyplot.imshow(test_picture)
Numpy array of the image is:  [[[207 204 199]
  [208 205 200]
  [209 206 201]
  ...
  [206 207 212]
  [205 206 211]
  [205 206 211]]

 [[206 203 198]
  [207 204 199]
  [208 205 200]
  ...
  [206 207 212]
  [205 206 211]
  [205 206 211]]

 [[205 202 197]
  [206 203 198]
  [207 204 199]
  ...
  [206 207 212]
  [205 206 211]
  [205 206 211]]

 ...

 [[106  90  91]
  [111  97  97]
  [118 104 104]
  ...
  [193 193 203]
  [190 190 200]
  [190 190 200]]

 [[111  97  97]
  [115 101 101]
  [121 107 107]
  ...
  [193 193 203]
  [191 191 201]
  [191 191 201]]

 [[118 104 104]
  [121 107 107]
  [125 110 113]
  ...
  [190 190 200]
  [190 190 200]
  [190 190 200]]]
Out[4]:
<matplotlib.image.AxesImage at 0x117f342ddc0>
In [5]:
test_picture_filtered = 2*test_picture/3
pyplot.imshow(test_picture_filtered)
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Out[5]:
<matplotlib.image.AxesImage at 0x117f3cef220>

Discussion¶

I believe the RGB colour data is being manipulated by the mutiplication and division.

Task 3-3: Exploring scikit-learn (a.k.a sklearn)¶

In [6]:
from sklearn import datasets
dir(datasets)
Out[6]:
['__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__getattr__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_arff_parser',
 '_base',
 '_california_housing',
 '_covtype',
 '_kddcup99',
 '_lfw',
 '_olivetti_faces',
 '_openml',
 '_rcv1',
 '_samples_generator',
 '_species_distributions',
 '_svmlight_format_fast',
 '_svmlight_format_io',
 '_twenty_newsgroups',
 'clear_data_home',
 'dump_svmlight_file',
 'fetch_20newsgroups',
 'fetch_20newsgroups_vectorized',
 'fetch_california_housing',
 'fetch_covtype',
 'fetch_kddcup99',
 'fetch_lfw_pairs',
 'fetch_lfw_people',
 'fetch_olivetti_faces',
 'fetch_openml',
 'fetch_rcv1',
 'fetch_species_distributions',
 'get_data_home',
 'load_breast_cancer',
 'load_diabetes',
 'load_digits',
 'load_files',
 'load_iris',
 'load_linnerud',
 'load_sample_image',
 'load_sample_images',
 'load_svmlight_file',
 'load_svmlight_files',
 'load_wine',
 'make_biclusters',
 'make_blobs',
 'make_checkerboard',
 'make_circles',
 'make_classification',
 'make_friedman1',
 'make_friedman2',
 'make_friedman3',
 'make_gaussian_quantiles',
 'make_hastie_10_2',
 'make_low_rank_matrix',
 'make_moons',
 'make_multilabel_classification',
 'make_regression',
 'make_s_curve',
 'make_sparse_coded_signal',
 'make_sparse_spd_matrix',
 'make_sparse_uncorrelated',
 'make_spd_matrix',
 'make_swiss_roll',
 'textwrap']

Dataset choices¶

I'm going to look at wine and iris because they sound interesting

In [7]:
wine_data = datasets.load_wine()
print(wine_data.DESCR)
.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
    ============================= ==== ===== ======= =====
                                   Min   Max   Mean     SD
    ============================= ==== ===== ======= =====
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0.98  3.88    2.29  0.63
    Flavanoids:                   0.34  5.08    2.03  1.00
    Nonflavanoid Phenols:         0.13  0.66    0.36  0.12
    Proanthocyanins:              0.41  3.58    1.59  0.57
    Colour Intensity:              1.3  13.0     5.1   2.3
    Hue:                          0.48  1.71    0.96  0.23
    OD280/OD315 of diluted wines: 1.27  4.00    2.61  0.71
    Proline:                       278  1680     746   315
    ============================= ==== ===== ======= =====

    :Missing Attribute Values: None
    :Class Distribution: class_0 (59), class_1 (71), class_2 (48)
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data

The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.

Original Owners: 

Forina, M. et al, PARVUS - 
An Extendible Package for Data Exploration, Classification and Correlation. 
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.

Citation:

Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science. 

.. topic:: References

  (1) S. Aeberhard, D. Coomans and O. de Vel, 
  Comparison of Classifiers in High Dimensional Settings, 
  Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of  
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Technometrics). 

  The data was used with many others for comparing various 
  classifiers. The classes are separable, though only RDA 
  has achieved 100% correct classification. 
  (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data)) 
  (All results using the leave-one-out technique) 

  (2) S. Aeberhard, D. Coomans and O. de Vel, 
  "THE CLASSIFICATION PERFORMANCE OF RDA" 
  Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of 
  Mathematics and Statistics, James Cook University of North Queensland. 
  (Also submitted to Journal of Chemometrics).

In [8]:
wine_data.feature_names
Out[8]:
['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']
In [9]:
wine_data.target_names
Out[9]:
array(['class_0', 'class_1', 'class_2'], dtype='<U7')
In [10]:
wine_data.keys()
Out[10]:
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
In [11]:
iris_data = datasets.load_iris()
print(iris_data.DESCR)
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...
In [12]:
from sklearn import datasets
import pandas

wine_data = datasets.load_wine()
wine_dataframe = pandas.DataFrame(data=wine_data['data'], columns = wine_data['feature_names'])
#wine_dataframe.head()
wine_dataframe.describe()
Out[12]:
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000
mean 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270 0.361854 1.590899 5.058090 0.957449 2.611685 746.893258
std 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859 0.124453 0.572359 2.318286 0.228572 0.709990 314.907474
min 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000 0.130000 0.410000 1.280000 0.480000 1.270000 278.000000
25% 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000 0.270000 1.250000 3.220000 0.782500 1.937500 500.500000
50% 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000 0.340000 1.555000 4.690000 0.965000 2.780000 673.500000
75% 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000 0.437500 1.950000 6.200000 1.120000 3.170000 985.000000
max 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000 0.660000 3.580000 13.000000 1.710000 4.000000 1680.000000

Discussion¶

I believe the head() function creates a table with the feature names as column headings. The describe() function then populates the row headings and data

Task 3-5: Thinking about data bias¶

It is important to consider the data you are using to form your dataset, as bias is almost always present. This can include or leave out variables that cause inaccurate conclusions to be drawn. 5 types:

  • Response or activity bias
    • Where the majority of data is produce by a minority of the population
    • Solved by debiasing techniques to improve fairness
  • Selection bias
    • Non-random subset of items presented to users (popular items at top left)
    • Solved by randomising choices
  • System drift
    • System changes that change how user interacts with system (Google adding new recommended searches)
    • Solved by factoring in these changes
  • Omitted variable
    • Key variable not recorded
    • Solved by taking this into account or collecting extra data before processing
  • Societal bias
    • Data created by humans can often have gender/race stereotypes embedded
    • Solved by debiasing techniques to improve fairness

Techniques to correct these depend on the type of bias and can be put in place before, during or after processing the data, which is often then used in AI and machine learning.

Weighing data to minimise bias and to simulate real-world populations.

In [ ]: